Add DataDome server-side validation and multi-backend routing#470
Add DataDome server-side validation and multi-backend routing#470
Conversation
aram356
left a comment
There was a problem hiding this comment.
Summary
Adds DataDome server-side bot validation and multi-backend routing. The backend router design is solid with good test coverage. However, the build.rs changes break environment variable overrides at build time, and several runtime issues need attention.
Findings
🔧 Must fix
- build.rs uses
from_tomlinstead offrom_toml_and_env: env var overrides silently ignored at build time (build.rs:54) rerun_if_changedmisses integration env vars:Settings::default()has empty IntegrationSettings HashMap (build.rs:32)- Sampling hash has poor distribution: byte-sum hash produces identical values for IPs with same bytes in different order (datadome.rs:506)
validation_timeout_msconfigured but never applied: misleading config field (datadome.rs:148)
❓ Questions
- DataDome validation on every request: static assets (CSS/JS/images) all trigger a 200ms HTTP call — intentional? (publisher.rs:276)
backends.tomlin open-source repo: ~175 production domains routing to a specific origin — should this be private?
🤔 Worth addressing
BackendRouter::newerrors silently swallowed via.ok()(publisher.rs:237)- Regex patterns panic at request time, not startup (backend_router.rs:66)
- Domain index silently overwrites on collision (backend_router.rs:120)
- No tests for
validate_requestlogic (~80 lines, zero coverage)
🌱 Future considerations
static OnceLockpatterns prevent test isolation (publisher.rs:227, datadome.rs:751)extract_host(datadome.rs:301) uses fragiletrim_start_matchesparsing whenurlcrate is already a dependency. Also,split('/').next()on a non-empty string never returnsNone, sounwrap_or("api-js.datadome.co")is unreachable with a misleading hardcoded domain.
CI Status
- Analyze (JS/TS): PASS
- Rust build/test/clippy/fmt: Not run
crates/common/build.rs
Outdated
| // Merge base TOML with environment variable overrides and write output | ||
| let settings = settings::Settings::from_toml_and_env(&toml_content) | ||
| // For build time: use from_toml to parse with environment variables | ||
| let mut settings = settings::Settings::from_toml(&toml_content) |
There was a problem hiding this comment.
🔧 from_toml() does not read environment variables — there's even a test proving this (test_from_toml_ignores_env_vars). The original build.rs used from_toml_and_env() to merge TRUSTED_SERVER__* env vars into the binary config. This silently breaks deployments that set secrets (API keys, origin URLs) via env vars at build time.
let mut settings = settings::Settings::from_toml_and_env(&toml_content)
.expect("Failed to parse settings at build time");| println!("cargo:rerun-if-changed={}", BACKENDS_CONFIG_PATH); | ||
|
|
||
| // Create a default Settings instance and convert to JSON to discover all fields | ||
| let default_settings = settings::Settings::default(); |
There was a problem hiding this comment.
🔧 Settings::default() returns an empty IntegrationSettings HashMap, so collect_env_vars will never discover any TRUSTED_SERVER__INTEGRATIONS__* env vars. Changing e.g. TRUSTED_SERVER__INTEGRATIONS__DATADOME__SERVER_SIDE_KEY won't trigger a rebuild.
The old build.rs used an always-rebuild sentinel (_always_rebuild_sentinel_) specifically because integration keys can't be enumerated. Consider restoring that approach.
| // Sample rate check: only validate sample_rate% of requests | ||
| if self.config.sample_rate < 100 { | ||
| // Use client IP hash for deterministic sampling | ||
| let hash = client_ip.bytes().fold(0u32, |acc, b| acc.wrapping_add(u32::from(b))); |
There was a problem hiding this comment.
🔧 This sums ASCII byte values — IPs like "1.2.3.4" and "4.3.2.1" produce identical hashes. IPs in the same /24 subnet cluster together since only the last octet differs. Sampling will be very non-uniform.
let hash = client_ip.bytes().fold(0u32, |acc, b| {
acc.wrapping_mul(31).wrapping_add(u32::from(b))
});| /// Lower timeout = faster fail-open on `DataDome` API issues | ||
| #[serde(default = "default_validation_timeout_ms")] | ||
| #[validate(range(min = 50, max = 1000))] | ||
| pub validation_timeout_ms: u64, |
There was a problem hiding this comment.
🔧 This field is validated (50-1000ms range) but never applied — the comment at line 564 says Fastly doesn't support per-request timeouts. Users setting validation_timeout_ms = 100 would expect it to take effect. Either wire it into BackendConfig timeout support or remove the field to avoid misleading configuration.
|
|
||
| // DataDome server-side validation (if enabled) | ||
| // This validates requests at the edge before they reach the origin | ||
| match crate::integrations::datadome::validate_request_server_side(settings, &req) { |
There was a problem hiding this comment.
❓ This runs on every proxied request — CSS, JS, images, fonts all trigger a DataDome HTTP call (200ms timeout). Most bot protection validates only document/navigation requests. Is this intentional? If so, worth a comment explaining why.
| fn compiled_regex(&self) -> Option<&Regex> { | ||
| if let (Some(regex_cell), Some(pattern)) = (&self.path_regex, &self.path_regex_pattern) { | ||
| Some(regex_cell.get_or_init(|| { | ||
| Regex::new(pattern).expect("path_regex pattern should be valid") |
There was a problem hiding this comment.
🤔 Invalid regex in config will expect() panic on the first matching request, not at startup. BackendRouter::new() doc says it returns errors for invalid patterns but doesn't actually validate them. Consider compiling the regex eagerly in the constructor and returning the error there.
| for (idx, backend) in backends.iter().enumerate() { | ||
| for domain in &backend.domains { | ||
| let normalized = normalize_domain(domain); | ||
| domain_index.insert(normalized, idx); |
There was a problem hiding this comment.
🤔 If two backends claim the same domain (or domains that normalize to the same value), the second silently overwrites the first. Could cause hard-to-debug routing. Consider detecting duplicates and logging a warning.
| /// Unique identifier for logging/debugging (optional). | ||
| #[serde(default)] | ||
| pub id: Option<String>, | ||
| /// Backend-specific timeouts (unused - for future use). |
There was a problem hiding this comment.
♻️ ~30 lines of code for "unused - for future use". Consider removing until actually wired into BackendConfig.
| /// assert_eq!(normalize_domain("sub.example.com"), "sub.example.com"); | ||
| /// ``` | ||
| #[must_use] | ||
| pub fn normalize_domain(domain: &str) -> String { |
There was a problem hiding this comment.
⛏ Allocates twice on every request (to_lowercase() then to_string() after trimming). Domains stored in the index are already normalized at construction time, but the incoming host is re-normalized on every call. Could use Cow<str> to avoid allocation when no transformation is needed.
| pub struct BackendRoute { | ||
| pub origin_url: String, | ||
| pub certificate_check: bool, | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
⛏ If the field is unused, consider removing it rather than suppressing the warning.
aram356
left a comment
There was a problem hiding this comment.
Summary
Adds server-side DataDome bot validation at the edge and a multi-backend routing system that selects origins by domain/path. Both features default to disabled with safe fail-open behavior.
Findings
🔧 Blockers (P0)
- build.rs no longer applies env var overrides:
from_toml_and_envwas changed tofrom_toml, so env vars likeTRUSTED_SERVER__INTEGRATIONS__DATADOME__SERVER_SIDE_KEYare never merged. Thererun_if_changedenv var enumeration is now dead code. (crates/common/build.rs:54)
🔧 High (P1)
- Sampling hash has poor distribution: Byte-sum hash for IPv4 sampling produces collisions for IPs with same octets in different order. Use
DefaultHasheror similar. (crates/common/src/integrations/datadome.rs:506) - Validation runs on every request: Static assets (images, CSS, JS, fonts) all trigger a blocking HTTP call to DataDome. Should filter to document/navigational requests only. (crates/common/src/publisher.rs:276)
- Duplicate domain silently overwrites: If two backends declare the same domain,
HashMap::insertkeeps the last one with no warning. (crates/common/src/backend_router.rs:120)
🤔 Medium (P2)
validation_timeout_mshas no effect: Fastly Compute doesn't support per-request timeouts, but the field is validated and documented as if it works. Misleading for operators. (crates/common/src/integrations/datadome.rs:148)- Hardcoded fallback in
extract_host(crates/common/src/integrations/datadome.rs:306):.unwrap_or("api-js.datadome.co")— if URL parsing fails, this silently uses a hardcoded domain instead of the configuredapi_origin. Should propagate the error or derive the fallback from config. BackendTimeoutsunused: Declared "for future use" — adds config surface with no current benefit. (crates/common/src/settings.rs:364)- Inconsistent cookie trimming:
trim_start()in publisher.rs vstrim()in datadome.rs for the same pattern. (crates/common/src/publisher.rs:216) - No test coverage for
validate_request: Sampling, cookie extraction, request building, response handling, and fail-open/closed logic are all untested. (crates/common/src/integrations/datadome.rs:751)
⛏ Low (P3)
- 175 hardcoded domains in backends.toml: Customer-specific data baked into the binary. Consider whether this should come from a config store. (crates/fastly/backends.toml)
CI Status
- Analyze (JS/TS): PASS
- fmt: Not run
- clippy: Not run
- rust tests: Not run
- js tests: Not run
crates/common/build.rs
Outdated
| // Merge base TOML with environment variable overrides and write output | ||
| let settings = settings::Settings::from_toml_and_env(&toml_content) | ||
| // For build time: use from_toml to parse with environment variables | ||
| let mut settings = settings::Settings::from_toml(&toml_content) |
There was a problem hiding this comment.
🔧 P0 — Env var overrides no longer applied at build time
This changed from from_toml_and_env → from_toml, which means environment variables like TRUSTED_SERVER__INTEGRATIONS__DATADOME__SERVER_SIDE_KEY are no longer merged at build time. The rerun_if_changed function below still enumerates env vars, but they're never applied — making it dead code.
Fix: Restore from_toml_and_env, or if this is intentional, remove the rerun_if_changed env var enumeration and document why env vars are no longer baked in.
| // Sample rate check: only validate sample_rate% of requests | ||
| if self.config.sample_rate < 100 { | ||
| // Use client IP hash for deterministic sampling | ||
| let hash = client_ip.bytes().fold(0u32, |acc, b| acc.wrapping_add(u32::from(b))); |
There was a problem hiding this comment.
🔧 P1 — Sampling hash has poor distribution for IPv4
Simple byte-sum produces identical hashes for IPs with the same octets in different order (1.2.3.4 == 4.3.2.1). With % 100, sampling will be far from uniform.
Fix:
use std::hash::{Hash, Hasher};
let mut hasher = std::collections::hash_map::DefaultHasher::new();
client_ip.hash(&mut hasher);
let sample = (hasher.finish() % 100) as u8;|
|
||
| // DataDome server-side validation (if enabled) | ||
| // This validates requests at the edge before they reach the origin | ||
| match crate::integrations::datadome::validate_request_server_side(settings, &req) { |
There was a problem hiding this comment.
🔧 P1 — Validation runs on every request including static assets
validate_request_server_side is called unconditionally — images, CSS, JS, fonts all trigger a blocking HTTP call to DataDome (200ms default timeout). This:
- Adds latency to every static asset
- Multiplies DataDome API usage/cost
- Could hit DataDome rate limits
DataDome validation is typically needed only for document/HTML requests.
Fix: Filter by request path or accept header to skip known static assets, or at minimum only validate navigational requests.
| for (idx, backend) in backends.iter().enumerate() { | ||
| for domain in &backend.domains { | ||
| let normalized = normalize_domain(domain); | ||
| domain_index.insert(normalized, idx); |
There was a problem hiding this comment.
🔧 P1 — Duplicate domain silently overwrites
If two backends declare the same domain, HashMap::insert silently keeps the last one. This can cause hard-to-debug routing issues in production.
Fix: Log a warning or return an error:
if let Some(existing) = domain_index.insert(normalized.clone(), idx) {
log::warn!("Domain '{}' appears in multiple backends (index {} and {})", normalized, existing, idx);
}| /// Lower timeout = faster fail-open on `DataDome` API issues | ||
| #[serde(default = "default_validation_timeout_ms")] | ||
| #[validate(range(min = 50, max = 1000))] | ||
| pub validation_timeout_ms: u64, |
There was a problem hiding this comment.
🤔 P2 — validation_timeout_ms has no effect
This field is validated and documented, but the comment at line 564 says Fastly Compute doesn't support per-request timeouts. Operators will configure this expecting it to work.
Fix: Either remove the field, or log a warning at startup when it's set to a non-default value so operators aren't misled.
| /// Backend-specific timeouts (unused - for future use). | ||
| #[serde(default)] | ||
| #[validate(nested)] | ||
| pub timeouts: BackendTimeouts, |
There was a problem hiding this comment.
🤔 P2 — BackendTimeouts unused
The comment says "unused - for future use." This adds config surface area and serialization overhead with no current benefit. Per project conventions: don't design for hypothetical future requirements.
crates/common/src/publisher.rs
Outdated
| .map(|cookies| { | ||
| cookies | ||
| .split(';') | ||
| .any(|cookie| cookie.trim_start().starts_with(&synthetic_cookie_prefix)) |
There was a problem hiding this comment.
🤔 P2 — Inconsistent cookie trimming
trim_start() here vs trim() in datadome.rs:521 for the same cookie-parsing pattern. Standardize on trim() for robustness.
| ) -> Result<bool, Report<TrustedServerError>> { | ||
| // Cache the integration after the first call. Settings are fixed at startup, | ||
| // so first-caller-wins is correct — the same binary serves all requests. | ||
| static INTEGRATION: OnceLock<Option<Arc<DataDomeIntegration>>> = OnceLock::new(); |
There was a problem hiding this comment.
🤔 P2 — No test coverage for validate_request logic
The entire server-side validation flow — sampling, cookie extraction, request building, response status handling, fail-open/closed — has no unit tests. The #[cfg(test)] in get_client_ip shows testing was considered but the actual validation path is uncovered.
Adds two complementary features for edge-level request control: DataDome server-side validation: - New server_side_enabled config flag on DataDome integration - validate_request_server_side() calls the DataDome HTTP API at the edge before proxying to origin, blocking detected bots with 403 - Configurable: API key, validation endpoint, timeout, fail-open behavior, and sample rate for gradual rollout - Integration cached in OnceLock -- built once, reused across requests Multi-backend routing: - New BackendRouter in backend_router.rs selects origin URL based on request host and path (domain index + path prefix/regex patterns) - Arena Group backend config lives in crates/fastly/backends.toml and is merged into the embedded binary config at build time by build.rs - trusted-server.toml contains only generic schema documentation - Synthetic cookie is only set on responses when not already present
c9e3ce1 to
5cace16
Compare
Adds two complementary features for edge-level request control:
DataDome server-side validation:
Multi-backend routing:
Summary
Changes
Closes
Closes #317
Test plan
cargo test --workspacecargo clippy --all-targets --all-features -- -D warningscargo fmt --all -- --checkcd crates/js/lib && npx vitest runcd crates/js/lib && npm run formatcd docs && npm run formatcargo build --bin trusted-server-fastly --release --target wasm32-wasip1fastly compute serveChecklist
unwrap()in production code — useexpect("should ...")tracingmacros (notprintln!)